M100 21 THE OPEN UNIVERSITY e 
Mathematics Foundation Course Unit 21 


Probability and Statistics I 


The Open University 


Mathematics Foundation Course Unit 21 


PROBABILITY AND STATISTICS III 


Prepared by the Mathematics Foundation Course Team 


Correspondence Text 21 


The Open University Press 


Professor Frank Downton acted as consultant for this unit. 


The Open University Press 
Walton Hall, Bletchley, Buckinghamshire 


First published 1971. Reprinted 1972 
Copyright © 1971 The Open University 


All rights reserved 

No part of this work may be 
reproduced in any form, by 
mimeograph or any other means, 
without permission in writing from 
the publishers 


Designed by the Media Development Group of the Open University 


Printed in Great Britain by 
EYRE AND SPOTTISWOODE LIMITED 
AT THE GROSVENOR PRESS PORTSMOUTH 


SBN 335 01020 2 


Open University courses provide a method of study for independent 
learners through an integrated teaching system, including textual material, 
radio and television programmes and short residential courses. This text 
is one of a series that makes up the correspondence element of the Founda- 
tion Course. 


The Open University’s courses represent a new system of university-level 
education. Much of the teaching material is still in a developmental stage. 
Courses and course materials are, therefore, kept continually under 
revision. It is intended to issue regular up-dating notes as and when the 
need arises, and new editions will be brought out when necessary. 


For general availability of supporting material referred to in this book, 
please write to the Director of Marketing, The Open University, Walton 
Hall, Bletchley, Buckinghamshire. 


Further information on Open University courses may be obtained from 
The Admissions Office, The Open University, P.O. Box 48, Bletchley, 
Buckinghamshire. 


12 


211 


21.1.0 
21.1.1 
21.1.2 
21.1.3 
21.1.4 


21.2 
21.2.1 
21.2.2 
21.2.3 

213 
21.3.0 


21.3.1 
21.3.2 


Contents 


Objectives 
Structural Diagram 
Glossary 

Notation 
Bibliography 
Introduction 


Random Variables and Distributions 


Introduction 

The Choice of a Sample Space 

Random Variables 

Probability Distributions 

Simple Properties of Probability Distributions 


Sampling and its Consequences 


Random Sampling 
Sampling Statistics and Their Distributions 
Parametric Models and Statistics as Estimates 


The Binomial Distribution 
Introduction 


Deriving the Distribution 
Estimating an Unknown Probability 


iit 


Objectives 


The general aim of this unit is to introduce random variables, probability 
distributions, and sampling, all of which are fundamental to statistical 
work, 


After working through this unit you should be able to: 
(i) select the most suitable sample space in any given (simple) situation ; 
(ii) explain what is meant by a random variable and its probability 
distribution ; 

(iii) explain what is meant by the expectation of a random variable, and 
calculate the expectation in simple cases; 

(iv) derive the variance of a simple discrete distribution ; 

(v) compare two distributions in terms of their means and variances; 

(vi) explain what is meant by a sampling distribution, and distinguish it 
from the parent probability distribution; 

(vii) use sampling statistics to estimate unknown parameters, and 
compare the suitability of different statistics for this purpose in 
simple cases; 

(viii) derive the theoretical form of the binomial distribution asa particular 
case of a sampling distribution; 

(ix) find the mean and variance of the binomial distribution. 


Note 


Before working through this correspondence text, make sure you have 
read the general introduction to the mathematics course in the Study 
Guide, as this explains the philosophy underlying the whole course. 
You should also be familiar with the section which explains how a text is 
constructed and the meanings attached to the stars and other symbols in 
the margin, as this will help you to find your way through the text. 


FM 21.0 


FM 21.0 


Structural Diagram 


Random Variables 
and Distributions 
21.1 


Probability and 
Statistics |! 


Sampling 
21.2 


The Binomial 
Distribution 
21.3 


Glossary 


Terms which are defined in this glossary are printed in CAPITALS, 


BINOMIAL 
DISTRIBUTION 


ESTIMATE OF 
A PARAMETER 


ESTIMATOR OF 
A PARAMETER 


EXPECTATION 


MEAN OF A 
PROBABILITY 
DISTRIBUTION (OR 
OF A RANDOM 
VARIABLE) 


PARAMETER OF A 
PROBABILITY, 
DISTRIBUTION 


PROBABILITY 
DISTRIBUTION 
OF A RANDOM 
VARIABLE XY 


RANDOM SAMPLE 
FROM A 
DISTRIBUTION 


RANDOM VARIABLE 


REALIZATION 
OF A RANDOM 
VARIABLE, 


The BINOMIAL DISTRIBUTION is the PROBABILITY 
DISTRIBUTION in which 


n 
P(X =r) = ("lra co.) ain 


(r = 0,1,...,”), where n and p are PARAMETERS. 


An ESTIMATE OF A PARAMETER iS a REALIZATION of a 
SAMPLING STATISTIC used to estimate the value of the 
PARAMETER in question. 


An ESTIMATOR OF A PARAMETER iS a SAMPLING 
STATISTIC whose SAMPLING DISTRIBUTION is centred 
on (and closely grouped about) the PARAMETER 
whose value has to be estimated. 


Given the PROBABILITY DISTRIBUTION P(X = a,) = 
pAr = 1,2,...,m) and a function g with domain 
{@,.@2..... Gq}, the EXPECTATION of g(X) is 


r 8(4,)p,. 


The MEAN OF A PROBABILITY DISTRIBUTION is 


L 4p, 


a 


where the sample space is {a,.@2.. 
has the associated probability pr = 1.2... .. n). 


A PARAMETER OF A PROBABILITY DISTRIBUTION is a 
mathematical parameter (i.e. an arbitrary constant 
which distinguishes various specific cases) occurring 
in the expression of the probability distribution. 


A PROBABILITY DISTRIBUTION OF A RANDOM VARIABLE, 
X, is a function 


a,-—> pr = 1,2,...,n). 


The domain of the function is a numerical sample 
space {@,,4>,...,a,}, Where the probability 
associated with a, is p,(r = 1,2,...,n). 

(By the laws of probability, we have 0 < p, <1 
(r = 1,2,...,n), and 


We write 
P(X = a,) = pr = 1,2, 


A RANDOM SAMPLE FROM A DISTRIBUTION isa sequence 
of REALIZATIONS of a RANDOM VARIABLE, obtained 
from a sequence of independent trials. 


on). 


A RANDOM VARIABLE is any numerical quantity 
whose value is determined by the outcome of a trial 
where the outcome is not predictable, 


A REALIZATION OF A RANDOM VARIABLE, X, is a 
numerical value taken by ¥ in a particular trial. 


FM 21.0 


30 


21 


SAMPLING 
DISTRIBUTION 


SAMPLING STATISTIC 


STANDARD 
DEVIATION OF A 
DISTRIBUTION (OR 
OF A RANDOM 
VARIABLE) 


VARIANCE OF A 
DISTRIBUTION (OR 
OF A RANDOM 
VARIABLE) 


A SAMPLING DISTRIBUTION is the PROBABILITY 
DISTRIBUTION Of a SAMPLING STATISTIC. 


A SAMPLING STATISTIC is a RANDOM VARIABLE based 
on a RANDOM SAMPLE rather than on an outcome of 
an individual trial. 


The STANDARD DEVIATION OF A DISTRIBUTION is the 
Positive square root of the VARIANCE OF THE 
DISTRIBUTION (OR OF THE RANDOM VARIABLE). 


The VARIANCE OF A DISTRIBUTION (OR OF A RANDOM 
VARIABLE) is the EXPECTATION of g(X) where 


g:X'—*(X — pf 


and y is the MEAN OF THE DISTRIBUTION (OR OF THE 
RANDOM VARIABLE). 


vii 


FM 21.0 


Page 


Notation 

The symbols are presented in the order in which they appear in the text. 

x The usual notation for a random variable. 

P(X =a,) The probability that the random variable X takes the value 
a= 

E(X) The expectation of X. 

H The mean of a distribution (or of a random variable). 

E(g(X)) The expectation of g(X). 

var(X) The variance of a distribution (or of a random variable), 

o The standard deviation of a distribution (or of a random 
variable). 


P(X >Y) The probability that the random variable X takes a value 
greater than that of the random variable Y. 


Bibliography 


This correspondence text tries to tell you as much as possible about 
statistics while covering a minimum of the technical detail. No published 
book is written in quite this way, so it is difficult to make suggestions. 


However, 


F. Mosteller, R. E. K. Rourke and G. B. Thomas, Probability with Statistical 
Applications 2nd edition (Addison-Wesley 1961) 


and 


S. N. Collings, Theoretical Statistics: Basic Ideas (Macdonald, to be 
published in late 1971) 


will both help on particular topics such as random variables, discrete 
distributions, expectations, means and variances, and the binomial 
distribution. Both books carry the subject beyond the scope of Unit 2/. 
Mosteller goes a long way beyond, introducing continuous probability 
distributions in the process ; Collings gives more attention to the sections 
which are relevant to the unit. 


viii 


21.0 INTRODUCTION 


In Unit 16, Probability and Statistics I we considered the nature and 
manipulation of simple data, and in Unit 18, Probability and Statistics IT 
we considered probability and its rules of operation. This unit is concerned 
with statistics : the application of probability theory to physical situations 
with a view to the interpretation of the data and the drawing of inferences. 
Such inferences are drawn by means of standard procedures on which most, 
although notall, statisticians are agreed. Indeed, one way of thinking about 
statistical theory is that it is the body of such standard procedures. By 
analogy with this, integration theory would be simply the body of standard 
integrals. But just as there is more to integration theory than the mechanical 
evaluation of integrals, so there is more to statistical theory than the rote 
churning out of inferences. Without some understanding of the basic 
difficulties, the advantages and disadvantages of the various procedures 
and the inherent assumptions, it is impossible to exercise any discretionary 
choice between possible procedures or to know to what extent apparent 
conclusions need to be qualified. Such understandings form an important 
part of statistical theory — a part which is seldom the subject matter of a 
textbook. Most statistical books concentrate on the standard procedures 
although, admittedly, in the end they convey something of the background 
problems. We shall not have time to give an adequate coverage of standard 
procedures in this unit, so we shall not attempt to do so; in any case, for 
positive reasons, we prefer to go into the background problems. In the 
course of this we shall mention some standard procedures, but only to 
illustrate points in the argument. 


To reinforce the distinction between standard procedures and background 
understanding, let us take an analogy from the clothing industry. A 
salesman who hands down a ready-made suit to his client must know his 
job first; but he clearly knows far less about clothes than a tailor who is 
capable of making a suit to measure. A tailor will know at least as much 
as a salesman about what is the most suitable cloth for the purpose; in 
addition, he will know how the seams hold together, and he will be able 
to carry out those adjustments which make all the difference between a 
good fit and a poor fit. 


Our objective is not to present an exhaustive thesis on background 
problems, but to show that there are background problems. In a sense, 
therefore, we are raising more problems than we are solving. In this, what 
we are doing is akin to philosophy: we are identifying the logical and 
mathematical problems raised in arguing from particular data to general 
properties. As in many areas of philosophy, the solutions to these problems 
are often controversial, Because we are discussing general principles, we 
shall not make a point of varying the physical settings; on the contrary, 
we shall stay on familiar ground. We shall mention dice, and dice again; 
but each time there will be something different, some new mental attitude 
to assimilate. 


We begin by looking again at mathematical models, or rather, in this case, 
probability models. We have already met the urn model (Unit /8), in which 
drawing balls randomly from an urn is equivalent to selecting people 
randomly, or throwing a fair die. Now, given balls in an urn, and given a 
random selection procedure as discussed in Unit 18, we can work out the 
probability of any outcome or combination of outcomes; that is, we can 
apply probability theory. In practice, however, we have people; we have 
dice; we have processes for selecting the people and for throwing the dice. 
But we cannot know with certainty that the physical operations we carry 
out on physical objects are accurately represented by the urn model. 
To pursue this further, let us restrict ourselves to the case of a single die. 


A die is said to be fair if and only if: for a single throw, the sample space is 
{1, 2,3, 4,5, 6}, where 


(i) the probability of each elementary event is $; 
(ii) the elementary events are statistically independent. 


In the case of a fair die, the urn model with six numbered balls, which are 
replaced after selection, faithfully represents the situation, and there is no 
problem. But is the particular die fair? In other words, is the model 
accurate for this die? The only way to find out is to experiment and sce, 
and the first thing we realize is that it is in fact quite difficult to detect bias. 
Suppose that instead of being fair the die were biassed towards the 6 in 
such a way that the probability of obtaining a 6 was { and of obtaining 
each of the other numbers was 35. This is a fairly substantial bias towards 
the 6, representing an increase of 50 per cent in the probability of a 6 
occurring. Yet to obtain sufficient data to discriminate between a fair die 
and one with this bias, with only a small risk of making the wrong decision, 
would require the die to be thrown several hundred times. 


In most situations where the statistician is attempting to interpret data, 
a validation of the model is not possible. This is why in most situations a 
statistician has no alternative but to use his judgment and experience to 
construct a model, and his interpretation of data will always contain the 
explicit or implicit statement “if the model is correct”. Arguments about 
the interpretation of statistics are thus almost invariably not about the 
computations (which may easily be seen to be right or wrong) but about 
whether the model on which the interpretation is based is valid or not. 
Because statistical investigations often deal with questions which give 
rise to strong emotions (for example, the long drawn out controversy 
about whether smoking causes lung cancer), the real nature of the argument 
is commonly lost. Also, because the validation of a model, even when 
sufficient information is available, is usually a sophisticated statistical 
exercise, the statistician looks at the implications which simple models have 
for the interpretation of data; this is the attitude adopted in this unit. 
In practical situations the experienced statistician may well have reserva- 
tions about his interpretation because of the limitations of his model. 
This does not necessarily prevent him from using that model, which may 
be the only way of making any interpretation. 


To see what is meant by this last remark, suppose we doubt the accuracy 
of the urn model for representing a single die because the event {6} appears 
to have probability greater than 4. If we do not know the true probability, 
we have to estimate it. We might throw the die 1 000 times and get a 6 on 
254 occasions. In this case we would accept } as the probability of getting 
a 6, and you might think we had made no questionable assumptions. 
We have not taken any model of the situation ; we have simply carried out 
an experiment with a completely open mind, and drawn our conclusions 
from the data. But despite first appearances, we have made assumptions! 
We have assumed (for instance) that 


(i) the probability of a 6 is constant; that is, independent of the number 
of earlier throws of the die; 


(ii) the event {6} is statistically independent of the other elementary 
events; that is, the probability of a 6 is independent of the results of 
earlier throws. 


The point is that these are assumptions, and they could be invalid: 
(i) could be false if the die suffered appreciably from wear, and (ii) could be 
false if the table were slightly damp so that the face last in contact with the 
surface was rather stickier than the other faces. Yet unless we make these 
kinds of assumption, we can scarcely begin to analyse the data and draw 
any conclusions. 


FM 21.0 


Thus we cannot avoid making assumptions and constructing models. 
That is why one has to acquire a reasonable “feel” for the subject, to learn 
by experience which models are acceptable, and which are not, in any 
given situation. 


Let us, however, turn from these general considerations to particular 
situations and ideas. 


FM 21.0 


21.1 RANDOM VARIABLES AND DISTRIBUTIONS 
21.1.0 Introduction 


In Unit 18, Probability and Statistics 11, we saw that the idea of a sample 
space whose elements (points) represent all the possible outcomes ofa 
trial is fundamental to a study of probability and statistics. However, in 
probability and statistics, as in other branches of mathematics, the simple 
ideas often only enable us to deal with simple problems. To do more than 
this requires more powerful mathematical tools, which may seem at first 
sight to take us further away from the physical problems with which they 
are concerned, Before beginning to develop these more powerful methods, 
it is worth while looking at a particular example to see why the simple 
probability concepts are unlikely to be adequate to cope with real 
problems. 


21.1.1 The Choice of a Sample Space 


Suppose a population of 20 000 000 voters contains 9 000 000 (45 per cent) 
who, if they vote, support the Conservative Party, 9 000 000 (also 45 per 
cent) who support the Labour Party, ! 000 000 (5 per cent) who support 
the Liberal Party and 1 000000 (5 per cent) who support other parties. 
Even without any more than an intuitive idea of the nature of probability, 
most people would be happy with the statement that the probability ofa 
“man in the street”’ being a supporter of the Conservative Party is 0.45. 
As outlined in Unit 18, we could imagine a sample space of 20000000 
points all with equal probability, each point representing a particular 
voter. The event “a man in the street supports the Conservative Party” 
then consists of all the 9000000 points representing Conservative 
supporters, and the probability of this event is given by 9 x 10°/ 
20 x 10° = 0.45. The advantage of making this explicit is that it lays bare 
the assumption implied by the phrase “a man in the street”. We do not 
mean by this phrase a voter approached, say, by a television interviewer 
at a street corner, but a voter selected in some scientific way, so that all 
voters are equally likely to be chosen. Assuming that the selection of the 
“man in the street” has been made in this scientific manner, the way in 
which the sample space is defined is not unique. If our only purpose is to 
record the probabilities of selecting the four different kinds of voter, it 
is sufficient to consider a sample space containing only four points labelled 
Conservative, Labour, Liberal and Others, with associated probabilities 
0.45, 0.45, 0.05 and 0.05 respectively. 


Sample Space 1 Sample Space 2 
(20 000 000 points) (4 points) 


FM 21.1.0, 21.1.1 


241 


Thus, points in the first space are mapped to points in the second, and any 
alternative descriptions of the results of a particular experiment will be 
connected with the basic sample space in a similar way. For example, 
if we were interested only in Conservatives and non-Conservatives, a 
sample space of two points would be sufficient. The sample space appro- 
priate to a particular trial depends not only on the trial but also on its 
purpose. These ideas of possible alternative sample spaces, and of the 
sample space which is appropriate to a particular situation, are develop- 
ments of the subject beyond the elementary definition given in Unit 18, 
There, a sample space was simply the set of all possible outcomes. 


The position becomes more complicated if we want to predict the result 
ofan election in which only three-quarters of the eligible voters participate. 
Intuitive ideas of probability are insufficient to give any meaningful 
prediction, even if it is assumed by generalizing from “the man in the 
street” that all possible groups of 15.000 000 voters are equally likely to 
exercise their right to vote. If all voters were treated as individuals and 
each sample point from this experiment represented a different selection of 
15 000 000 voters, the number of points in the resulting sample space 
would be of the order of 10°°°°°°°. Clearly such a space needs to be 
contracted, or at least simplified in some way, for it to become intelligible. 
The sample space which is likely to lead to a manageable description of the 
probability situation is not immediately obvious, and in any case it is not 
unique. The point to be clear about is that it is usually not possible to 
separate the construction of a sample space from the purpose to which it 
is to be put. 


As another example, a large aircraft might have two tyres on each leg of its 
undercarriage. Suppose that on landing one (or both) of these tyres might 
burst. Failure might be caused either by a manufacturing fault or by 
particularly severe wear due to the condition of the runway. Denoting the 
two tyres by L (Left) and R (Right), the state of a tyre after landing by 
F (Failed) and N (Not failed), and the cause of failure by M (Manufacturing 
fault) and W (Wear), the possible 9 outcomes of an experiment (the aircraft 
landing) are 


(LFM;RFM) (LFM;RFW) (LFM;RN) 
(LFW; RFM) (LFW; RFW) (LFW;RN) 
(LN; RFM) (LN; RFW) (LN; RN) 
However, for the pilot of the aircraft, a sufficient division of the possible 
results might be a sample space containing the three points: 
(i) neither tyre bursts ; 
ii) exactly one tyre bursts; 
(iii) both tyres burst. 
On the other hand, to the tyre manufacturer, a two point sample space 
might be sufficient, consisting of 
(i) no manufacturing faults; 
(ii) one or more manufacturing faults. 
(In the second case he might be liable for damages.) 


To see the sort of thinking needed, try the following exercise. 


Exercise 1 


When a lift in a block of flats with nine floors is stationary, its doors may be 
open or closed. 

(i) If, ata particular time, the lift is known to be stationary at one of the 
floors, how many points are there in the sample space which defines 
the exact state of the lift? (The “state” of the lift means the floor which 
it is at, and whether the doors are open or closed.) 


FM 21.1.1 


Exercise 1 
(4 minutes) 


(ii) If the lift is stationary at some floor and not working, the maintenance 
man has to carry his tools upstairs from the ground floor to the floor 
on which the lift is stuck. What sample space is sufficient for his 
purpose? 

(iii) If it takes one second for the lift doors to open or close and one second 
for the lift to travel between any two adjacent floors, what is the 
sample space which would suitably describe the state of the lift for a 
man who presses the CALL button on the middle floor (in the sense of 
how many seconds it will be before he can step into it)? (Assume that 
no one else calls the lift.) 


(HINT: Find the time corresponding to each of the possible states.) 


These considerations become more important when probability is not 
used simply as a descriptive model for a physical situation, but is the basis 
for statistical methods. To return to the 20000 000 potential voters ; for 
the statistician, the picture of voting behaviour described so far is merely a 
preliminary discussion of the real problem. 


Although it is not the purpose of this unit to go deeply into the solution of 
problems of voting behaviour, it is worth looking at the real nature of the 
problem, to see why the simple concept of the sample space with its 
associated probabilities is insufficient to deal with what the statistician 
has to do. 


The statistician will not, as we previously assumed, know the true pro- 
portions of different types of voters in the population. Ifhe is both lucky and 
industrious, the sort of information available will be: 
(i) the total size of the voting population by constituencies (from electoral 
rolls), together with voting patterns at previous elections; 

(ii) the way in which a few thousand people say they will vote (from public 
opinion polls); 

(iii) the number of people who are likely to vote and how these people are 
distributed between the parties (from past experience and opinion 
polls), 

The information in each of the three categories will contain uncertainties 
which may or may not be expressible in terms of probabilities. To try to 
describe the overall situation in terms of a single sample space would, 
even if it were possible, result in a description so complicated as to be 
useless for predictive purposes. The need to contract this kind of sample 
space into a manageable pattern is behind the idea of a random variable, 
which we shall now discuss. 


FM 21.1.1 


21.1.2. Random Variables 


Before we give a definition of a random variable, it is worth looking at 
another example which illustrates in a simple way one of the main 
purposes of our definition. 


Example | 


Suppose we have an ordinary die whose faces are numbered from | to 6. 
If this die is thrown twice and the number on the uppermost face is recorded 
for the two throws (as an ordered pair), a sample space containing 36 
points will represent the set of possible results. Thus the sample space may 
be represented as follows: 


4,1) (4,2) 4,3) (1,4) a5) 4,6) 
(2,1) (2,2) (2,3) (2,4) (2,5) (2,6) 
(3,1) (3,2) (3,3) (3,4) (3,5) (3,6) 
(4.1) (4,2) (4,3) (4,4) (4,5) (4,6) 
(5,1) (5,2) (5,3) (5,4) (5,5) (5,6) 
(6.1) (6,2) (6,3) (6,4) (6,5) (6,6) 


If it is assumed that the die is fair, then each of these 36 points has prob- 
ability 34. 


Suppose that we are interested not in the individual results, but in the 
possible total scores obtained from the two throws of this die. 


9 am- 
88)-“(3) 


(4,5)--“(a,6)_- 
oe 


The total scores run from 2 to 12, and the 36 point sample space may (as 
indicated by the diagonal lines in the array above) be mapped to a new 
space whose points are the eleven numbers: 


2,3,4,5, 6, 7,8, 9, 10, 11, 12. 


Even if we use the model of the fair die, the points in this new space will no 
longer have equal probability. A total of 2 can be obtained in only one 
way, as (1, 1), and hence has probability 4. On the other hand, a total of 
7 may be obtained in six ways, as (1, 6), (2, 5), (3, 4), (4, 3), (5, 2) and (6, 1), 
and so the probability of obtaining a total of 7 is the sum of the probabilities 
of these separate results, which is $; = 4. a 


The ideas illustrated in this example lead to the definition of a random 
variable. 


FM 21.1.2 


2.4.2 


Example 1 


(continued on page 8) 


Solution 21.1.1.1 


(i) Denoting the floor by a number from | to 9, door open by o, and door 
closed by c, there are 18 points :(1, 0), (1, c), (2, 0), (2, ¢),--..(9, 0).(9. ¢). 
(ii) Since only the floor is relevant, this sample space contains 9 points: 
Ly Qicasi 
(iii) The f8 points lead to the corresponding times; 


(1.0) | (2,0) | (3,0) | (4.0) | (5,0) | (6.0) | (7.0) | (8.0) | (9.0) 
6 5 4 3 0 3 4 5 6 
(he) | (2c) | Be) | 0) | S.e) | (6c) | (7c) | (Be) | (96) 
5 4 3 2 1 2 = 4 > 


The state of the lift may be described by 7 points representing times of 
0, 1, 2,..., 6 seconds. a 


(continued from page 7) 


Consider a trial whose possible outcomes are numbers a, ,@,...,4,, We 
denote by X a general element from this set of numbers, and X is called a 
random variable. Note that we define a random variable as a general 
element from a sample space of numbers. That is not to say that all other 
cases are “beyond the pale”. There may be some reason for mapping 
elements of a non-numerical sample space to numerical values. For 
example, in a particular card game we may not be interested in the suit 
of a card, and so we can map the sample space of 52 cards to the set 
{1,2,3,..., 13}; we could then think of a random variable taking values 
from this new sample space. 


Suppose we have thrown a die three times, and are about to throw it a 
fourth time. The values on the first three occasions are 2, 5, 4, and on the 
fourth occasion the value is as yet unknown. We can represent this 
unknown value by the random variable X which may take any value in 
the set {1, 2, 3,4, 5, 6}. 


If the fourth throw gives the value 3, then we say that 3 is the realization of 
the random variable on this occasion (or that X takes the value 3). In the 
same way we would say that X had realizations (or took the values) 2, 5,4 
on the first three throws. All we have to keep clear is the distinction between 
the random variable X when X stands for an unknown value, 

and 

the realization of X, which is a specific number. 


Before the fourth throw, we are interested in the values X may take on 
this occasion ; but we are also interested in the probabilities with which X 
may take these values. Thus, for a fair die, X may take the value 1, and the 
probability that it does so is }. In a more general situation, the set of 
possible outcomes of a trial may be the set of numbers {a,,a2,.-.,d,}- 
If the outcome a, has probability p,, and X stands for the outcome of a 
trial, then we denote the probability that X has the realization a, by 
P(X = a,); that is, we write 


P(X = a,) = p, 


FM 2111, 21.1.2 


Solution 21.1.1.1 


Notation I 


As we have mentioned, sometimes the outcomes of trials which are not 
themselves numbers have numbers associated with them; in other words, 
there is a mapping (a function, in fact) from the sample space to the real 
numbers R. In this case it is as if the associated real numbers were the 
outcomes of the trials, so that these numbers can be represented by a 
random variable. One has to take a little care when this mapping is not 
one-one, for, in general, if E,, E,,..., E, are all the outcomes mapping to 
a(ae R),and X isthe random variable standing for thenumber correspond- 
ing to any trial, then 


P(X = a) = P(E,) + P(E) + --- + P(E,) 


For example, if the trial consists of drawing a card from a pack, and the 
aces and court cards are assigned the value 10 whilst every other card is 
assigned its face value, then 


P(X = 10) = P({Ace}) + P({King}) + P({Queen}) + P({Knave}) 
+ P({Ten}) 


=8 

Another type of situation in which the outcomes are not single numbers 
arises when the outcomes are pairs of numbers. For example, if a trial 
consists of throwing a die twice, the outcomes are ordered pairs of integers 
(u,v), where | <u < 6, and 1 <v < 6, If we are interested simply in the 
total score, the obvious* thing to do is to map (u, v) toa = u + v,and then 
to take the random variable X having as its realizations all the possible 
values of a. 


Let us see how this works out for a fair die. Only (1, 1) maps to 2. Only 
(1, 2), (2, 1) map to 3. Only (1, 3), (2, 2), (3, 1) map to 4, etc. In our model for 
fair dice, each outcome has probability ¥;, so that we have: 


Nu 
w 
es 
ow 
2 
= 
2 
© 
Ss 
I 


2 3 4 é é 
PX =a) | 36 | $6 | 5 | 56 | ae |G | oS | ae | 


Exercise 1 


Fora fair die which has been thrown twice, give the set, S,of possible values 

for the appropriate random variable, together with the probabilities 

associated with these values, if we are interested in: 

(i) the larger of the two scores (or the common score if the two are equal) ; 

(ii) the difference between the first score and the second score: 
(first score) — (second score). a 


Exercise 2 


In Exercise 21.1.1.1, the time between a man pressing the CALL button on 
the middle floor and his stepping into the lift is a random variable. If the 
lift is equally likely to be at any of the nine floors, and the door is always 
twice as likely to be closed as open, determine the probability of each of the 
possible values of the random variable. as 


* “Do not imagine that mathematics is hard and crabbed, and repulsive to common sense- 
It is merely the etherealization of common sense.” 
W. Thomson (Lord Kelvin) 
S. P. Thompson, Life of Lord Kelvin 
(London, 1910) 


FM 21.12 


Exercise 1 
(3 minutes) 


Exercise 2 
(4 minutes) 


Solution 1 


From the table given on page 7, we have: 


(i) If X is the larger of the two scores, then S = {1, 2,3, 4,5, 6}, and 


P(X = 1)=35, P(X =2)= 
P(X =3)=%5, P(X =4)= 
P(X =5)=35, P(X =6)= 


(ii) If X is the difference between the first and the second scores, then 
S = {-5,5, —4,4, —3,3, —2,2, —1, 1,0}, and 


= 35 
% 
a 
% 
=, 


P(X = —S)= 35, 

P(X = -4) = %, 

P(X = -3)= 45, 

P(X = -2)= 

P(X = -1)=%5, 

P(X = 0) = ¥. 
Solution 2 


The probability that the lift is on any floor is $, and, since the door is twice 
as likely to be closed as open, the probability that it is open on any floor is 
5, and the probability that it is closed is 4. Hence, using the results 
tabulated in Solution 21.1.1.1(iii), together with these probabilities, we 


have: 


P(X = 5) 
P(X = 4) = 
PX =3)= 
P(X =2)= 
P(X = 1) 


%. 
x 
i. 


Time | 0 ; 


2 


oe 


Probability | ra 


al 


Ne 


|e] 


FM 21.1.2 


21.1.3. Probability Distributions 


Given a trial with associated numerical outcomes a,,az,...,4,, We have 
introduced the random variable X and written 


PX =a)=p,  (r=1,2.....m) 


where p, is the probability of the outcome a,. Once we know the a’s and 
the p’s for any trial, we know how the total probability, namely 1, is 
distributed over the outcomes a, .a2,.... a,. In more formal terms, when 
we have the a’s and the p’s, we can define a function which maps a, to its 
probability p,; the function has domain the set {a,,a,..., a,} and 
codomain the set of real numbers R. This function is called the probability 
distribution of the random variable X. (In fact, it would be better to 
regard both the set {a,,...,a,} and the set of images {p,,p2,-..,p,} as 
sequences rather than sets, because we are interested in the distribution of 
the p's, so that each of the p’s, whether repeated or not, is of interest.) 


The graph of such a probability distribution can be represented by a bar 
chart; for example: 


Py 


Starting with a trial having associated numerical outcomes, we are led to 
consider a random variable X, and then the probability distribution of 
X, which is specified in terms of numbers a, and p,. Conversely, if we take 


any values a,(r = 1,2,..., n) and values p, satisfying 
O<p<l 

and 
y pei, 

there is a conceivable random variable X satisfying 
P(X = a,) = p,. 


Given any trial with associated numerical outcomes, the first step is usually 
to work out the probability distribution of the random variable X standing 
for the outcome; this is either because we simply want the probabilities of 
the individual outcomes, or because the probability distribution is a half- 
way stage towards finding something more complicated. 


Exercise | 


In a road safety competition the promoters list ten motor accessories 
(e.g. seat belts, fog lamps) which contribute to road safety. Competitors are 
asked to select the three which are most effective, and this selection is also 
made by a panel of experts. A competitor chooses his three items by select- 
ing three balls from an urn containing ten balls labelled with the names of 
the accessories. If X is the number of accessories which are common to 
this competitor’s choice and that of the experts, find the probability 
distribution of X. 


(HINTS: What is the numerical sample space? How many selections of 
3 accessories can be made out of 10? How many of these have 0, 1, 2,3 
accessories from the experts’ list?) a 


PM 21.13 


2413 


Main Text 


Definition 1 


Exercise 1 
(4 minutes) 


Solution 1 


As suggested in the last question in the hints, the numerical sample space 
is (0, 1, 2,3}. In this exercise we can use as our model an urn containing 
three white and seven black balls, the three white balls representing the 
three chosen accessories. Three balls are chosen from this urn in such a way 
that all possible selections are equally likely. If, for convenience, we regard 
the balls as distinguishable, we have 


W,, Wz, Wy, B,, Bz, By, Bs, Bs, Be, By. 
The number of selections of 3 balls from these 10 is the number of combina- 
tions of 3 objects from 10 objects; that is, (") = 120, all of which are 
equally likely. The number of such selections containing : 
(i) no white balls = number eee of selecting 3 balls from the 7 black ; 
that is, (; = 35, 


(ii) 1 white ball = number of ways of selecting 1 ball from the 3 white 
and 2 from the 7 black ; that is, 


f)=()-ssa-0 


(iii) 2 white balls = number of ways of selecting 2 balls from the 3 white 
and | ball from the 7 black; that is, 


(= ()-3=7=2 


(iv) 3 white balls is clearly just 1. 
Therefore, we have 
P(X = 0) = io = 7 


PX = I) = fo = 3 
P(X = 2) = ts = a 
P(X = 3) = 7h. 
and so the probability distribution is the function: 
O-— zs 
p34 
I< 
3+-— rho . 


FM 21.13 


Solution 1 


21.1.4 Simple Properties of Probability Distributions 


Ifa trial is carried out a number of times and the outcome of each trial 
recorded, then we shall end up with a list of elements chosen from the 
sample space. If the sample space consists of numbers, then this list will be 
simply a list of numbers. This list of numbers can be regarded as data to 
be analysed in much the same way as we analysed data in Unit 16, 
Probability and Statistics I; we can calculate the mean, variance, and so on. 
By considering these measures when the trial is carried out a large number 
of times, we are led to define measures which are called the mean of the 
distribution, the variance of the distribution and so on. In this section we 
begin to explain these ideas. 


Expectations 


Suppose X is a random variable with associated probability distribution 
defined by: 


PX=a)=p, (r=1,2,..., n). 


If we were to perform a sequence of N trials and find that X takes the value 
a, On m, Occasions, a; On m, occasions, etc., then the average of the values 
taken by X so far would be 


may + mya, + «+> + Md, 


N 
my my; Mm, 
=a + ay t ay 


Now 7 is the relative frequency of occurrence of the value a,, and we 
have associated the probability of any event with the relative frequency of 


3 m 
occurrence of that event in the long run. Hence we would expect me to be 


approximately equal to p,. Similarly we would expect 2 to be approxi- 


mately equal to p,, and so on. Therefore we would expect the average of 
the values taken by X to be approximately equal to 
4p, + a:p2 + +++ + aPa = Yap, 
rei 
Note that N does not appear in this expression. We call this expression 
the expectation of X ,or sometimesthe expected value of X ; we denote it by 
E(X) or by the Greek letter: (“mu”). Thus we have: 


E(X)= Y ap, =" 
rt 
jis often also called the mean of the distribution. Another way of looking 
at ris to say that it is the weighted mean of the a,’s, the weight of a, being 
the probability with which it occurs. 


Exercise | 


For the road safety competition of Exercise 21.1.3.1, find the expected 
number of accessories on which the competitor and the experts agree. 


Exercise 2 


Using the probability distribution for the number of seconds taken by a 
lift to arrive, already obtained in Exercise 21.1.2.2, determine the expected 
value of the waiting time. a 


FM 21.1.4 


214 
Introduction 


Main Text 


Exercise 1 
(3 minutes) 


Exercise 2 
(2 minutes) 


Solution 1 
Using the results of Exercise 21.1.3.1, 
3 
E(X) = Dr x P(X =n) 
r=0 
=Ox A+ lxH+2 x H+3x bh 


= yet = 2 


Solution 2 
Using the results of Exercise 21.1.2.2, 
X= H+ ht h+ H+ 8 +H 
=# 


So, on average, one might expect to wait 34 seconds for the lift. a 


Although E(X) is called the expected value of X, we do not expect a single 
random X to take this value, or even be very close to it. It is rare for a 
random X to equal y, sometimes impossible, as in the last two exercises. 
All that we do expect is that with a large number of realizations of X from 
the same distribution, their average value will be close to y. For a fair die, 
the expected value of the random variable is 


Lena tom 


ral 
=%=35 


To know the mean of a distribution is to know something, but it is a long 
way from knowing the whole distribution, A die is not necessarily fair 
just because the expected value of the score is 3.5. For example, a die 
whose score is represented by the random variable X, where 


P(X = 1) = P(X = 6)=4 
and 
P(X = 2) = P(X = 3) = P(X = 4) = P(X =5)=h 


has expected score 3.5, but it is decidedly “loaded”. To achieve a fuller 
description of the probability distribution we need some other character- 
istics, and we can get these from a wider application of the idea of an 
expected value. 


We introduced expectations by talking about the expected value of X ; 
but there is no reason why we should not consider the expected value of X?, 
or indeed of g(X), where g is some general function with domain 
{a,,@,...,a,} and codomain R. The expectation of X is by definition 
given by 


E(X) = Yap, 
r= 
By analogy, the expectation of g(X) is given by 
Z(g(X)) = ¥ 8(a,)p,. 
rat 


because if a, g(a,) and a, occurs with probability p, , then g(a,) occurs 
with probability p,. (We need not worry if g(a,) = g(a,), as this is looked 
after by the probability p,.) 


FM 21.14 


‘Solution 1 


Main Text 


Let us first consider E(X). If X is the random variable corresponding to 
the score obtained with a single throw of a fair die, the expected value of 
X?7 is 
6 6 
E(X?)= ¥ Pp= Yr xh =¥ 
rel r=t 


Clearly, for any random variable X there is no end to the possible functions 
g, but the functions which have either some intuitive physical interpretation 
or some mathematical or statistical utility are relatively few. One such 
function enables us to calculate the variance* of a distribution. Consider 
g, with domain {a,,a2,...,a,}, where 


g(X) = (X — yw)’, and p = E(X). 
The expected value of (X — ,)?, namely 


E(X — nw) = ¥ (a, — wp, 


ra 


is called the variance of X, and is denoted by all sorts of things in the 
literature: notations currently in use are var(X), ¢2, 0? and 2 (¢ is the 
Greek letter “sigma”. 


From a computational point of view, it is often easier to make use of the 
result 


E(X — p)?) = E(X?) — p? 


(see the following exercise). 


Exercise 3 
(i) Prove that 
E(X — 4)?) = E(X?) — p* 


(ii) If X is a random variable standing for the outcome of a fair die, find 
the variance of X. 
(ili) If the die is then loaded so that 


P(X = 1) = P(X = 6) =} 
P(X = 2) = P(X = 3) = P(X = 4) = P(X =5)=4, 


find the expected value (:) of X and the variance of X. 

(iv) If Y =cX +d where c, d are constants, find an expression for 
E(Y) = py in terms of E(X) = py. 

(v) In (iv), prove that the variance of Y is the product of c? and the 
variance of X. a 


We are now going to examine what interpretation (if any) can be put on 
var (X). Suppose, as we did when discussing expected values, that X is a 
random variable with probability distribution defined by: 


WX=a)=p, (=1,2;..-.n) 
If we again perform a sequence of N trials and find that, of the outcomes, 
m, values equal a, , 


m, values equal a2, 


m, Values equal a,, 
then, as before, we would expect = to be approximately equal to p,. 
I 


* See section 16.1.2 of Unit 16, Probability and Statistics 1. 


FM 21.14 


Exercise 3 
(4 minutes) 


Main Text 


(continued on page 17) 


FM 21.14 


Solution 3 Solution 3 


(i) E(X — »)?) = Y (@, — wp, 
i 


(a? — 2na, + H?)p, 


i 
ims 


tl) 
Me 


a7p, — 2 Lae, +e LP 
if ia 


ra 


P ' 
ap,—2y>+y? {as Dp, = land 5 0,=1) 
1 rat rat 


= E(X*) — 

(ii) E(X = wy?) = BUX?) — 
=%-2 
=H 


As expected for a uniform distribution, the variance of X is just the 
variance of the set {1, 2, 3, 4, 5,6} 


(ii) EX) =44+94+E +9 + E+E 


=3=H 

EX?) =4448+9+ 9+ P+ ¥ 
= 16 

so that 


E(X?) — yw? = 16 - 


(iv) Hy = E(Y) = EleX +d) 


= ¥ (ca, + dp, for E(g(X)) = ¥. g(a,)p, 
1 


r= 


. * 
=e Yap +d > p, 


1 
=cE(X)+d 
= chy +d 
In particular, we notice by putting d = 0 that E(cX) = cE(X). 
(v) var (Y) = E((Y — ny)?) 
= E(eX + d — chy — 4) 
= Ee(X — px)" 
= c? var (X) a 


For convenience, we denote the outcomes of the N trials by {x,, 2. 
and the mean of this set by X. Then 


w= E(X) = ¥ ap, 


the mean of the outcomes of the N trials (the x values) considered simply 
as data. 


¥ @, - up, 
fs) 

_ (a= BF 
at ON 


Again, var (X) 


™, 


n — xP 
=> me SE (from above) 
ret 
N Fat 
(x, = 3) 
yON 
= variance of the x values considered as data. 


In Unit 16 we saw that the variance of data is used as a measure of spread 
of the data; we would therefore expect the variance of a distribution to 
be a measure of spread of the distribution. “Spread” conveys the idea of 
“width” or “linear extension”, but var(X) is obtained as the sum of 
quadratic expressions. We therefore use o (= ./var(X)) rather than 
o* (= var (X)) as a measure of spread, and call it the standard deviation 
of the distribution. Thus 


JE(X = wy) = 6 
= standard deviation 


In Exercise 3 we found that the fair and the loaded dice had equal means 
but different variances — and hence different spreads. But we do not just 
passively observe that different distributions have different spreads; 
the point is of vital importance in the Theory of Estimation which we 
discuss in section 21.2.3. 


We have now constructed some machinery for coping with simplified 
models of probabilistic experiments. This machinery will be used later in 
this unit, as well as in future years. 


Comparing Two Distributions 


In many probability situations we are faced with the problem of choice. 
Do we go in for this method of selection or that? Do we adopt this trial or 
that? Do we use this method or that for estimating an unknown value? 
However we choose, we shall be considering some distribution or other, 
and very often it comes down to choosing between distributions. Which is 
the more suitable for our purposes? Even taking the values a, for granted, 
it still takes n — 1 numbers to specify the distribution (why not n?). Hence 
we could be left trying to compare two distributions, each having n — 1 
measures, Somehow we must find a way of representing each distribution 
by a single measure. This is where expectations come in; for the distri- 
butions which we commonly meet have just one mean, so we can decide 


FM 21.14 


(continued from page 15) 


between two distributions by comparing their means (if indeed the mean 
is a relevant measure in the context). If the mean is not relevant, perhaps 
the variance may be relevant; if so, we have a single measure for compari- 
son purposes. 


Example | 


To fix our ideas, suppose we wish to play a simple dice game with an 
opponent, and we are offered a choice between a fair die having a prob- 
ability distribution given by 


P(X =r)=4 (r=1 
and a loaded die having a probability distribution given by 
PY =1)=P(¥Y=6)=4 


+6) 


P(Y=2)=4 
PY = 3) = P(Y=4)=%4 
PY =S)=% 


The two distributions can be illustrated by bar charts: 


Fair die Loaded die 
3 3 
A 
1 
+ 
a % 
# 
Tot) aac a a ORT. Pa 


If the object of the game is to throw the higher score, which of the dice 
should we choose? That is, which distribution suits us best? We may think 
on the following lines. 

(i) A 6isa very good number to get. The probability of throwing a 6 with 

the loaded die is twice the probability of throwing a 6 with the fair die, 
and therefore it is the better die to have. 
However, the probability of throwing a 1 with the loaded die is twice 
that for the fair die, and this cancels out the advantage mentioned 
above. It therefore looks as if we shall have to consider the whole 
probability distribution for the loaded die, not just part of it. 

(ii) If we do consider the whole distribution for the loaded die, it is surely 
relevant to consider the mean score. If the expected score from the 
loaded die is higher than that for the fair die, we should prefer to 
have it. 

Now for the fair die, 


E(X) = 3.5 


FM 21.14 


Example 1 


For the loaded die. 
EY) =1x$+2eh+3xhse4xtsxHt+6xd 
$8+5+45) 

neat 

= 4 = 3.375 < E(X). 


Therefore it is better to have the fair die. 

(iii) Argument (ii) seems relevant, but does it settle the matter beyond all 
further consideration? The only way to find out is to examine all 
possible pairs of outcomes, in which the first and second elements of 
the pair represent the outcomes of throwing the fair die and the 
loaded die respectively. These are shown below with the probabilities 
attached, 


¥: 


1 2 3 4 5 6 
(DBRS ESERaeee 
® Fl cy tt mn a 


RRR 
1 1 i 1 y 1 1 
aT cy n a w a 


des soem lee ie 
ie “ 


+ = 
a n 7 ia 


‘BOR Se eee 


‘ROR ae ee ew 
a * ® ee 4 


x 


ia 


1 

o BB i BB A GB Gi GB Be OB EB 
es ea 

beeen Xe 


xX>Y 


Because the events “X =r" and “Y = s” are statistically independent, 
the probabilities in this table have been obtained by multiplication. 
For example, 

P(X =1;¥ = 1)= P(X =1)x P(¥=1)=h x tay 


By adding up the probabilities in the three regions of the table, we find 
that 


P(X > Y) =H}, 
PIX = ¥)=%. 
P(X < Y) =, 


where, for instance, P(X > Y) is the probability that the realization of 
X is greater than the realization of Y. 


The sum of these probabilities is 1, reflecting the fact that the three events 
are mutually exclusive and exhaust all the possible outcomes of the trial. 


FM 21.1.4 


Equations (1) 


Equations (1) show that the fair die is “better” than the loaded die in the 
sense that, if they are both thrown once, the fair die is more likely to yield 
the larger score than the smaller. 


This backs up the conclusion reached in (ii), and this might have been 
expected; but it is not all that “obvious”. The higher mean does nor 
necessarily indicate that we are more likely to get the higher score with 
the fair die (see the following exercise). a 


Exercise 4 


Let X denote the score obtained from a loaded die, where the probability 
distribution of X is given by: 


P(X=rn=4 = (r=1,2,3) 
and 
P(X=rn=% (r=4.5,6). 


Let Y denote the score obtained from another loaded die with probability 
distribution given by: 


PY =rn=#% (r = 2,3) 
and 
PY=rn)=th  (r =1.4,5,6). 


Show that, if each die is thrown once, the first die is less likely to show the 
larger score than the second, but that the expectation of score for the first 
die is larger than for the second. a 


Summary 


We have learnt from this section that it is not always immediately possible 
to choose between two distributions, because there may be more than one 
measure relevant to the situation. But if it is quite clear what is meant by 
“better”, it may be possible to find just one characteristic of a distribution 
which can be used as a criterion for the choice. In our Example | this single 
measure was the distribution mean. In other circumstances, the more 
relevant single measure could be the distribution variance. Although these 
two measures do not exhaust the possibilities, they are the two in most 
common use, 


However, a consideration of the mean could lead us astray (as we have seen 
in Exercise 4); it is therefore safer to regard any particular single standard 
measure as indicating the correct conclusion rather than settling it beyond 
all doubt. It is always dangerous in statistics to apply standard techniques 
blindly. 


FM 21.14 


Exercise 4 
(3 minutes) 


21.2, SAMPLING AND ITS CONSEQUENCES 


21.2.1 Random Sampling 


If we throw a die which is known to be fair, then we have a probability 
model of the situation, and we can make predictions about outcomes. 
On the other hand, we may be faced with the very question: “Is the die 
fair?” A knowledge of the sort of results which would result from throwing 
both fair and loaded dice will clearly be a help in answering this question, 
Just how it will help is a matter for statistical theory. 


To investigate this question, we would have to experiment with the die, 
i.e,, throw it a number of times. The possible outcomes of any single trial 
are the numbers 1, 2, 3, 4, 5, 6. Let 


x, be the outcome of the first trial; 


x, be the outcome of the second trial ; 


x, be the outcome of the nth trial. 


x, is the numerical value taken by the random variable X in the first trial ; 
it is the realization of X on this occasion. Similarly for x,, and so on. 
The sequence of values x,,x2,...,x, is then called a random sample 
from the probability distribution of X. It is important to observe two 
properties of random samples: 


(i) each sample value x, is a realization of the same random variable X ; 
(ii) the realizations for the various trials are physically independent; 
therefore they are statistically independent (see Unit /8). 


Thus P(X 
= P(X 


x, in first trial; XY = x, in second) 
x, in first trial) x P(X = x, in second) 


= P(X = x,) x P(X = x). 
since X has the same probability distribution in all trials. For the whole 
sample, the probability of getting the sequence x,,.x2,....X, is 


P(X = x,) x P(X = x2) x +++ x P(X =x,) 


Remarks (i) and (ii) above are applicable to any random variable defined 
on any finite sample space, 

The concept of a random sample has so far been expressed in terms of a 
sequence of identical and independent trials; for example, a die thrown 
n times. The same model would apply equally to a set of simultaneous 
identical and independent trials; for example, n distinguishable dice all 
thrown at once, It is often convenient to regard a random sample 
X1,X2++++)X, as if each observation x, were a realization of a distinct 
random variable X,. We then get n random variables all with the same 
probability distribution. These two ways of regarding a random sample 
are equivalent, The second is usually the more useful mathematically. 
Randomness, like independence, has been expressed purely in mathematical 
terms. To get observations which may reasonably be supposed to conform 
to it, precautions often need to be taken. These may be obvious to a good 
experimenter; for example, a chemist making a series of chemical deter- 
minations would usually wash his test-tubes so that there was no risk of 
contamination between experiments. On the other hand, such precautions 
may need to be more subtle or may even be impossible. For example, in 
three successive landings of an aircraft it is impossible to believe that the 
chance of failure of a tyre is exactly the same on each occasion (we cannot 
disregard tyre wear or even pilot fatigue). It may, however, be possible, by 
tyre and/or pilot changes, to arrange the experiment so that the three 
observations obtained may reasonably be supposed to be a random 
sample from the population which is of interest. 


21 


(continued on page 22) 


Solution 21.1.4.4 
For X and Y alone the probability distributions are given by: 
2 ee OS Ue 
X 4 2 2 wt treed: 
Y ts 4 4% rae th th 


Bs a2 }s0 E(X) > E(Y). 
E(Y) = 483 = 2.54 
The distribution of X and Y is obtained as in the text: 
Y 
4 6 
1} sts sts 
2 oe se 
3 beak stelxer 
x 4 oe 
5 rh: 
6 rr 


From the table, we find that 
PUX > ¥) = 485 
P(X < Y) = He 
and so we see that P(X > Y) < P(X < Y), although E(X) > E(Y). a 


(continued from page 21) 


Exercise 1 


(i) A fair die is thrown twice. Write down all the possible random 
samples, together with their probabilities. (Each random sample is, of 
course, a sequence of two elements.) 

(ii) Two indistinguishable fair dice are thrown simultaneously. What 
difference, if any, would this make to the results obtained above? 

(iii) An experimenter throws two fair dice, and records the larger score 
(or the common score if the two scores are equal). He throws them a 
second time, and again records the larger score. Thus he finishes with 
a sequence of two recorded scores. Write down all the possible random 
samples together with their probabilities. a 


Exercise | is so similar to Example 21.1.2.1 on page 7 that you may be 
wondering if the definition of a random sample really introduces anything 
new. To help you understand the importance of the concept of a random 
sample, we shall now consider how to construct random samples from 
known probability distributions using random numbers. 


Suppose that we know the sample space of an experiment, but we do not 
know what the sequence of results would be if we were to perform the 
experiment several times. If we want to simulate the experiment, we need a 
method of producing such a sequence in a way which reflects the random- 
ness of the experiment and the probabilities assigned to each of the 
possible outcomes of the experiment. 


FM 21.1.4. 21.2.1 


Solution 21.1.4.4 


Exercise 1 
(3 minutes) 


Suppose that we have a random variable X which can take values 
4 ,@>,...,@, and that the probability distribution of X is given by 


P(X =a)" (r= 1,2,...54) 


where b,.b,...,b, are positive integers, and c is their sum. 


A table of random numbers consists of a sequence of digits obtained in a 
way equivalent to throwing successively a ten-sided fair die whose faces are 
numbered 0, 1, 2.....9. Suppose that c < 10. Then to obtain a random 
sample, x,, x2 +X,» from the known probability distribution, we read 
n successive digits from the random numbers. If n, is the rth such random 
number, then 


f0<n,<b,-1 we take x, = a); 
ifb; <n, <b, +b,-1 wetake x, = a2; 
ifb, +b, <n, <b, +b, +6; —-1 we take x, = a3; 
and so on. Ifany number obtained from the table is greater than or equal 
to c, we ignore it and continue with the next random number. 
Example 1 
The following is a table of 100 random digits: 
4 B210..5' 7 6 7-9. 9 


ea bw 6 As) 3) oD 
a rd oS 3) 5 7 DP B16 
Ph SS ee SS Or Be a 
Mr oF OF Ao 8 8: 
2 PR) Bt A OO) 2G: 
fot S24 CO 2 
SGlT So 2 2/9 2 EF 6 
6 £2 2 70) 4 Oweiee 
Bik aS 2 a Oe 

Let X be the random variable corresponding to throwing a fair die, so that 

a,=r (r = 1,2,...,6) 

P(X =a) =3 

In the notation we introduced above, we have 
b=1 (r = 1,2,...,6) 


c=6 


We now obtain a random sample using the above rules. The first random 
number in the table is n, = 4, and 


4=b, +b, +b; + bg <n, <b, +b, +b; +b, 4+ bs — 1, 


so that x,, the first element in our sequence, is 5. The second random 
number is 6 and, following our instructions, we ignore it. In general, if 
n, < c, we find that x, = n, + 1, and we get the random sample 


oe Gal hanes a 


If 10 < c < 100, we have to obtain random two-digit numbers from the 
table; we do this by reading the random digits in pairs. Apart from this 
modification, the above instructions hold. 


23 


FM 21.2.1 


Example 1 


(continued on page 25) 


Solution 1 


(i) This solution has in effect already been given at the beginning of 


section 21.1.2 (page 7). 


(ii) The only difference here is that there is now no distinction between the 
dice ; that is, a random sample is a sequence having only one element, 
that element being a pair of numbers (not an ordered pair); we therefore 


write the element which consists of the numbers a and b as the set 


{a, b}.* 

(11} {1,2} {1,3} {4} 

te nt 
{2,2} {2,3} {2.4} { 

fe rte) oe 
3.3} 3.4} { 
a 
{4.4} { 

¥ 


(iii) In order to see the general pattern, it is helpful to consider a few special 


cases: 
first throw second throw 
die A | die B die A | dieB 
1 1 1 1 
1 1 1 2 
1 1 2 1 
1 1 2 2 


5} {1,6} 
% 
5} {2,6} 
% 


5} {3,6} 
% 


} {4,6} 
% 

} {5,6} 
% 


{6.6} 
% 


random sequence 


The random sequence 1,1 can only be obtained in 1 way; the random 
sequence |, 2 can be obtained in 3 ways ; and so on, Since each experiment 


involves 4 throws, the sequence 1, 1 has probability 4: ; the sequence 1, 2 


has probability A; ; and so on. 


The samples with their probabilities are: 


Le 2 Ws) cans: te 
vive vibe vies Tiss thts 
2,132 223 24S 26 
tise vise this rite thts Tits 
Sir 32. Sa SB SiS) (St6 
vies this riis vids tits rits 
41 42 43 44 45 4,6 
vies tits Tits rits rits rite 
Sil, 5295,3) 54 55. 5:6 
vise ris its 18s hs 1s 
61 62 63 64 65 66 
tits Tits Tis its iis 


* In Unit J, Functions we noted that, when listing the elements of a set, we do not record 


repetitions. “Repetitions” in that context really means " 
context, we write the element consisting of the pair of numbers a and a as {a, a}, 


24 


repetitions"; in this 


FM 21.2.1 


The possibility of short cuts should not be overlooked. If we use the method 
described above to construct random samples from the distribution given 
by 


we have c = 5, and half the numbers in the random table are therefore 
wasted. It is better in this case to re-express the probabilities as 


P(X = 0) = % 

P(X = 1) = 5 
We now have c = 10; the process is less wasteful than before and easier to 
apply. 
A more sophisticated version of this technique, suitable for use on 
computers, is used under the name simulation to examine the consequences 
of probability models in situations where the calculations are very 
complicated ; it also has a minor place in the history of statistical theory 


where some of the standard results, now established mathematically, were 
first obtained by this procedure. 


The next exercise is intended to show that a small sample may not be 
adequate to distinguish between two probability models. 


Exercise 2 


Use the random numbers given in Example | to obtain: 
(i) a random sample consisting of 20 scores obtained from a fair die (see 
Example 1); 
(ii) a random sample consisting of 20 scores obtained from a loaded die, 
whose probability distribution is given by 
P(X = 1) = P(X = 6) =4 
P(X = 2) = P(X = 3) = P(X = 4) = P(X =5)= 45 


In this second case, begin reading the random numbers from the beginning 
of the next full row from where you left off in (i), and takec = 100. 


FM 212.1 


(continued from page 23) 


Exercise 2 
(2 minutes) 


Solution 2 Solution 2 
ti) The random sample is 
5,3, 1,6, 2,4, 2, 6, 5, 5, 6, 2,4, 4, 3, 4,4, 6,4, 6. 


This has used the first 31 digits. 
(ii) (Notice that with c = 20, we wouldn’t have enough digits left in the 
table.) 


c = 100, so we use two digits at a time from the table. 
We have 
by = 20,b; = bs = by = bg = 15, bg = 20 
and 
a=r  (r=1,2,...,6) 


Since b, = 20, for any number n, from the table less than or equal to 19, 
we have 


O<n, <b, — 1 = 19,andsox, = a, = 1. 


Proceeding in a similar way, we can obtain the table: 


Sears: 
O<n, <19 1 
20 <n, < 34 2 
35<n,<49 3 
30<n,<64 4 
65<n,<79 5 
80 <n, < 99 6 


We now use the digits in pairs beginning at the fifth row (i.e. we use 
67, 70, 74,...) and the random sample is 


5,5, 5,4, 5, 2, 6,3, 1,2, 1,1, 1, 3, 1, 3, 1, 2,6, 6. 


The following table shows the frequency with which each score occurs in 
the random sample. 


Score 


(i) Fair die 


(ii) Loaded die 


If anything, the results for the loaded die look more evenly distributed 
than those for the fair die. These numbers were not specially chosen to 
exhibit this property. 


21.2.2 Sampling Statistics and Their Distributions 21.2.2 
Introduction 
Given a random sample, x,,x2.--., x,, from some sample space, it will Introduction 


have a mean X. If we repeated the whole experiment, we would have 
another random sample x’, , x3,..., x), with mean x’. We could repeat the 
whole operation many times, getting further means X”, X” and so forth. 
Now we obtained the sample x, , x2, ..-, , by carrying out a trial n times, 
therth trial providing the value x, . If wethink of thesen trialsas constituting 


26 


one single “super-trial”, X is the outcome of this super-trial. Other super- 
trials give us the outcomes but this situation is no different in 
kind from ordinary trials: therefore x, are realizations of a 
random variable which may be denoted By XX is a special case of a 
sampling statistic, and the probability distribution of X is referred to as a 
sampling distribution, This nomenclature makes reference to the sampling 
process which gives rise to Y, butitalso has the advantage of distinguishing 
X from the original random variable X, and the sampling distribution of 
¥ from the original (often called parent) distribution of X. In other words, 
the “children” of parent distributions are sampling distributions. 


Sampling Statistics 
In the notation introduced in the previous paragraph, we can define a 
function which maps a random sample, x,,x2.....X,. to its mean X. 


More generally, if there is a function 
WX po Xz 006 Xp) — PAX. 20-0 HD) HY 


from the set of all random samples to R, then in the same way as for the 
particular case x, y is a realization of a random variable Y which is called a 
sampling statistic. The probability distribution of Y is called a sampling 
distribution. The subject of mathematical statistics is mainly concerned 
with the determination of these sampling distributions, 


Example | 


Returning to the special case of the mean, suppose we throw a fair die 
three times, and that we are interested in the arithmetic mean of the values 
%),Xz,X3 thus obtained. In other words, we are interested in 


_ Xt Xp + X3 


and in the sampling distribution of X. There are 6 possible values for each 
of x,. x), x3, and therefore there are 6° = 216 possible samples. Only the 
sample 1, 1, | has | as its mean. The samples 2, 1,1; 1,2, 1; 1. 1,2 all have 
means of }. The mapping from samples to their means is shown below: 


Random Samples 


By evaluating the means of all the 216 sample points and remembering 
that each occurs with a probability of s};. we find that the sampling 
statistic ¥ has the probability distribution : 


ea 
le 
oe 


FM 21.22 


Exercise 1 


A fair die is thrown k times, and x,,,, is the greatest of the k scores thus 
obtained. If X,,,, is the random variable having x,,,, as its realization, 
show that X,,,, has distribution given by 


x oe 
Aen = 0) = (2) -(eYy (= 1,2...,6). 


(HINT: Work out the answer first for x,4. = 1, then for Xmax = 2, ete.) 
a 


Sampling statistics are simply random variables, but they are based on 
random samples rather than on outcomes of individual trials; other than 
this, they do not raise any essentially new ideas. If, as in the simple cases 
which we have considered, the complete sampling distribution can be 
determined, then we have complete information for making probability 
statements about the sampling statistic. In complicated cases, we cannot 
obtain the sampling distribution, and we have to be content with knowing 
only its mean and variance; these measures are sometimes referred to as 
the sample mean and the sample variance of the statistic. 


The measures for the parentand sampling distribution are not unconnected. 
Let us look at the mean and variance of the sampling distribution of the 
average score for three throws of a fair die (Example 1). Using that 
distribution, we obtain 
aan r 2268 
E(R) = Lylx ={) = 2B -as 

Thus the mean of that sampling distribution is the same (3.5) as the mean 
which we obtained on page 14 for the parent distribution. 


At this stage, it is as well to take stock so as to avoid verbal confusion. 
Given any random variable X, it has a mean yp = E(X), Given a random 
sample, the sampling statistic X (whose realizations are sample means X) 
has a probability distribution whose mean is E(X). Thus in a sense we have 
the “mean of the mean”. There is no reason why any particular realization 
X should equal ; and we have seen no reason so far for connecting E(X) 
with . For the die, however, we have demonstrated for samples of three 
that A(X) =3.5= 


Similarly, we can calculate the mean of the sample variance, or the variance 
of the sample mean. 


Example 2 
For three throws of a fair die, we shall find var (X). 
var (X) = E(X?) — p? —_(see Exercise 21.1.4.3) 


3 
li ee 49 
cg) 
me tO BS ee Bk 
9 a 9 *216* 9 * 216+" 
34149 
Cast mee 
_ 25704 49 
~ 1944 4 
35 


28 


FM 21.2.2 


Exercise 1 
(5 minutes) 


Example 2 


This result may be compared with the result for the original probability 
distribution of X, for the score of a fair die, where we obtained var (X) = 
5. (See Exercise 21.1.4.3(ii).) Thus, by taking the average of a sample of 
three observations, the variance has been reduced by a factor of 3. This is 
not just a coincidence; it is a special case of a general rule about the 
variance of the sampling distribution of the mean. We shall see the 
practical importance of this in section 21.2.3, where we look at estimation. 


The computations of the mean and variance of X in this simple case 
suggest certain general properties. These general properties will not be 
proved in this unit, but we state them here. 

If E(X) = wand var (X) = o?, 


then for a random sample of size n, 
2 
E(X) = pand var (X) = = 


Since variance is used as a measure of dispersion, these results imply that 
the realizations of X are more closely clustered —and hence closer to 
u— the larger n becomes. 


Some of the material in this and earlier sections is summarized below. 

(i) Arandom variable X has realizations a,,a3,..., a, which are possible 
numerical outcomes of a single trial. 

(ii) A sampling statistic is a random variable associated with a random 
sample; it is some function h defined on the set of random samples. 


Equal to Denoted by 
(iii) expectation of X = mean 
of distribution of X Sone 
(iv) mean of x, ,X2,..-. en x 
(v) mean of random variables x 
pon " 
(vi) variance of x,,x2..... xe a 
(vii) variance of X o? or var (X) 
(viii) mean of sample means nw E(X) 
; ‘ 26 
(ix) variance of sample means < var (X) 


* These two results have been quoted without proof. 


FM 2122 


Solution 1 
P(X max = 1) = Plall k scores are 1) 
1\* 
-( 
1\*  [o\* 
“(0 
P(X saax = 1) + P(X wnax = 2) = P(X max < 2) 
P(X wax = 2) = P(X max < 2) — P(X max = 1) 
= P(all k scores are < 2) — (;) 
aye /a\* 
-(-0 
Similarly 


P(X pax = 3) = P(X max < 3) — P(X max < 2) 
3\* _ (2\* 
-(-8 
In general, 


P(X pax = 1) = PX mae <1) — PlX max SP — 1) 


-( -Fey . 


21.2.3. Parametric Models and Statistics as Estimates 


So far we have assumed that we knew the probabilities of the various 
outcomes in the sample space corresponding to a single trial. We have, 
for example, assumed that a die being thrown is fair, or if it is loaded, that 
the form of the bias is known; that the probability of failure of an aircraft 
tyre is known; and so on. In real life the nature of the bias in a die or the 
probability of failure of a tyre will be unknown. It is precisely these 
unknowns about which we hope to obtain information from our observa- 
tions. To do this we need first to construct models in which a statement 
such as “the die is loaded” is made more specific. This is usually done by 
the introduction of one or more parameters. 


If we know nothing about a die, all we can say about the random variable 
corresponding to the score is that it satisfies 


P(X =r)=p, (r= 1,2,..., 6) 
6 
where 0 < p, < Land ¥ p, =1; 
v1 


PysP2 + Po are called parameters of the system. They are constant in any 
situation (e.g. for the same die thrown in the same kind of way), though 
their values are unknown. The distribution defined above is our probability 
model of the situation. Note that models are not necessarily meant to be 
absqlutely correct in every detail. Given points on a graph, a scientist 
does not necessarily find a curve to pass accurately through each and every 
point. If the data appear to warrant it, he fits a straight line which represents 
the essential trend indicated by the points. In our case, we try to use any 
information about the physical system to reduce the number of parameters. 


30 


FM 21.2.2. 21.23 


Solution 1 


21.23 


Definition 1 


For instance, suppose that we have a die for which there is some evidence 
that it is biassed, because the 6’s do not occur one in six times in a long 
sequence of trials although all the other scores appear to occur equally 
often. If we take the probability of a 6 as p, and the probabilities of all 
the other numbers to be equal (or even approximately equal), then the 
appropriate model to take is the distribution given by 


P(X = 6) =p 


PIX =r) = 


On the other hand, we may know something of the construction of the die, 
and we may think that the following model is more accurate: 


P(X = 6)=p 
PX =n =3 (r = 2,3,4,5) 
P(X =1)=}$-p 


where 0 < p < 4 but pis otherwise unknown. 


Both these models contain a single parameter, because we have taken into 
account our knowledge of the system. (In much the same way, a scientist 
may fit a straight line to a set of points because other reasoning leads him 
lo expect a straight line, and the object of his experiment is to find a 
particular line rather than any curve to fit the data.) 


In general, the problem of choosing an appropriate model requires a great 
deal of background knowledge about the physical conditions of the 
experiment, as well as experience of analogous physical situations. 
The use of standard statistical methods often obscures the need to examine 
the appropriateness of the model. 


The determination of the sampling distribution of a statistic based on 
observations from a parametric model again presents no new problems in 
principle. What is new in the situation is that there is little point in taking 
the statistic unless it provides some information about the value of the 
parameter. From any particular sample it is possible to compute many 
different sampling statistics, and the nature of the information which any 
particular statistic provides about the parameter can only be determined 
from its sampling distribution. The first role sampling distributions play in 
statistical theory is therefore in the comparison of sampling statistics as 
estimates of the parameter (or parameters) in the model. 


Example 1 


As an example to illustrate this, consider a situation which at first sight 
might appear to be modelled by an N-sided fair die for which N is unknown. 
Suppose an enemy produces pieces of equipment, and each piece of equip- 
ment receives a serial number. Assuming that these serial numbers run 
from | to N, intelligence officers obtain the serial numbers of n of these 
pieces of equipment, and require information about N, the total number 
of pieces of equipment made. Alternatively, suppose a forger has produced 
banknotes numbered from | to N. After he has been caught, the Bank of 
England wish to know how many forged notes are in circulation. Again, 
the information about N will have to be obtained from the serial numbers 
of n notes which have been detected as forgeries. 


It is reasonable to assume that in each of these situations all existing serial 
numbers are equally likely to be obtained. (This assumption may be false. 
but here again we face the eternal problem for the statistician. What other 
assumption could he make?) Note that we assume that no two items have 
the same serial number; this is not exactly the situation described by a die. 
since for the die the same score can come up twice. If, as a model, we regard 


31 


FM 21.23 


Example 1 


all the serial numbers as being written on cards which are then shuffled, 
and the cards drawn one at a time, then the situation we are dealing with is 
random sampling without replacement, whereas throwing a die corresponds 
to random sampling with replacement. (Random selection with and without 
replacement was discussed in Unit /8.)In each of these situations, however, 
N is a parameter; it is an unknown, whose value needs to be specified if 
exact probability predictions of the result of an experiment are to be made. 
In the statistical context it is something which we wish to estimate from the 
data we obtain. The way a statistician does this is to take the realization 
of some sampling statistic whose distribution is in some sense grouped 
closely round the true value of N. A sampling statistic with this kind of 
property is sometimes said to be an estimator of the parameter. 


To illustrate the sort of considerations which would enter into our choice 
of such a statistic, suppose we obtain two serial numbers, x, and x,. This 
pair of numbers is supposed to be a random sample. As the first obvious 
attempt to estimate N, consider the average X¥ = eee, this is a 
realization of the sampling statistic X. The possible values this statistic can 
345 2N — Sse ag eat * 

take are pep esl The sampling distribution can be obtained by 
relatively simple arguments, but the work is tedious. (For example, to find 
E(X) we would have to find all the possible values of ¥ and then the 
associated probabilities. The probabilities can be calculated from the fact 


that there are N(N — 1) possible pairs, each with probability wot 
We shall simply quote the results we want, namely: 
N+1 
EX) = —— 
IN? +1IN +4 
I) 
so that 
TN? +1IN+4 (N+ 1) 
var (X) = SS s 
_N?-N-2 
elie iat 


Now the serial numbers have a distribution whose random variable is X, 


and the mean of X is xy + : 


. Further, we have the sampling distribution 


1 


of X, and this also has mean Ne: . Then we can either think of our 


2 
numerical X as being the mean of a sample® of 2 from the original distribu- 
tion, or as being a sample of | from the sampling distribution of X; 
the two amount to exactly the same in the end, for in either case we are 
N+1 


2 


mentally “equating” X with 
N+ 


. But if we are prepared to take X as our 


‘ 1 ‘ 
estimate of , we must be prepared to take 2X as our estimate of 


N + J,and hence 2X — 1asourestimate of N. The random variable having 
2X — | as its realization is 2X — 1; denoting this by N, we are using the 
sampling statistic N = 2X — | to estimate N. 


At this stage, let us take stock of the situation. Out of the N(N — 1) 
possible serial number pairs we have obtained one, and we have taken the 
mean, x, of the two numbers constituting the pair. If we repeated the whole 


* Remember that the sampling here is sampling with replacement. 


32 


FM 21.23 


FM 21.23 


thing from scratch (remember that we are sampling without replacement), 
we would get a mean X,. On the nth occasion we would get x,. These n 
values, X,.X2...., X,, are realizations of the sampling statistic X¥. We 
have taken N = 2X — 1,and we know that its expected value is equal to N. 
Although no realization of N is necessarily equal to N, it is in a sense 
centred on N as N is the expectation of NV. 


We observe further that X is distributed symmetrically about its mean 
N+1 


value of , so that X is distributed symmetrically about its mean 


value of biel 


. So N = 2X — 1 is distributed symmetrically about its 


mean value of N. Thus Q is just as likely to be too small by a certain 
amount as it is to be too big by that amount. This would appear to be a 
desirable property of an estimator. 


We have argued that 

N has mean value N ; 

N is distributed symmetrically about N. 
We also know that var (") 


= var (2X — 1) = 4 var(X) (see Exercise 21,1.4.3.) 
‘a ~_ 
= waNo? (see page 32) 


which gives us a measure of the spread of . 


Another statistic whose sampling distribution may easily be derived is 
X* = max(X,, X;). From the array of possible samples we may obtain 
the sampling distribution of X*, and hence work out the mean and 
variance of X*; these are given by 


iN +1 
ee ED 
3 
2_N-2 
ee et ee 
For the same kinds of reason as before, we would be prepared to accept x* 


- It follows that we 


2 
(a realization of X*)as an estimate of E(X*) = ax ~ 


would accept $x* — | as an estimate of N. This estimate is a realization of 
the random variable N* = }X* — 1, which has mean 


E(N*) =N 
and variance 
var (N*) = var (3X* — 1) 
= }var(X*) 
(N? — N —2) 


8 


So now we have two different estimates of N, both of them shown to be 
reasonable, Is there anything to choose between them? Both have mean 
N, so there is nothing to choose between them on that score. However, 


N?-N-2 
var (N) = 7 
and 
i awe) 
var(v*) =“ — N=? 


33 


Now variance is a measure of dispersion. Hence the smaller the variance, 
the more compact the distribution, and the nearer the values within the 
distribution will be to the mean. We know that var(N*) = 3 var (N), and 
therefore N* will tend to give a closer estimate than N; therefore N* is 
a better estimator than N. 


Two other points of comparison between N and N* are worth noticing. 

(i) 8 takes only integral values; N* can take fractional values. This does 
not, of course, rule out N* as an estimator. 

(ii) The distribution of N is symmetrical, but the distribution of N* is 
not. a 


Summarizing the general points about statistical procedures which this 
example has illustrated, we have seen that a comparison of the usefulness 
of sampling statistics in providing information about a parameter reduces 
to a comparison of their sampling distributions. Deliberately, no rules 
have been provided for this comparison, since what is important in one 
comparison may be unimportant in another. For example, if we required 
the sampling distribution of the estimate to be symmetrical about N, 
then of the two statistics N and N* we would clearly have to choose N as 
the estimator, even though it is not as close to N as N*. 


These remarks should give an idea of the kinds of considerations confront- 
ing a statistician. They are not intended to afford a complete study of 
estimation and of “good” properties which estimates (or rather their 
sampling distributions) should satisfy, In any event, in some situations we 
often have to be satisfied with less than the best, because the theory for the 
best is too intractable. 


PM 2123 


21.3. THE BINOMIAL DISTRIBUTION 
21.3.0 Introduction 


There are many situations where an event has a certain probability of 
occurring, and where we are interested in the number of successes in a 
sequence of trials. Thus assuming that the probability of a newborn baby 
being a boy is }, we could inquire about the probability of any specified 
number of boys in a family of n. In less down-to-earth vein, we could be 
interested in the probability of obtaining any specified number of 6's in n 
throws of a biassed die. Mathematically these two situations are equiva- 
lent: in each case we have 

P(*success”) = p : 

in a single trial, 
P(“failure”) = 1 — p 


and we are interested in the probability of, say, r successes in a sequence of 
n independent trials. 


If X is a random variable corresponding to the number of successes in a 
single trial, then the probability distribution of X is given by 


PX = \)=p 
P(X =0)=1—p 
Suppose we take a sample x,,..., x, from this distribution. Since each of 


the x’s is either 0 or 1, the total number of successes in this sample is 
t= Xy + Xp t+ +X, 

Now 1, is a realization of a random variable 
TaX,+X.4-4+X,, 


So We are interested in the distribution of T,, which is a sampling statistic. 
In other words, we are interested once again in a sampling distribution. 


21.3.1 Deriving the Distribution 
Let us begin modestly with n = 2. 


Denoting a success by S, and a success followed by a failure by SF, etc., we 
have 


P(S) = p 
P(F)=1-—p 
P(SS) = p* 
P(SF) = p(l — p) 


P(FS) = (1 — p)p 
P(FF) = (1 — p)? 
Therefore, 
P(T; = 0) = P(FF) = (1 — p?? 
P(T, = 1) = P(SF) + P(FS) = 2p(1 — p) 
P(T, = 2) = P(SS) = p? 


35 


FM 21.3.0, 21.3.1 


U3 


Main Text 


Similarly, for three trials we have 
P(FFF) = (1 — p)* P(FSS) = p*(1 — p) 
P(FFS) = p(l — p)? —- P(SFS) = p*(1 — p) 
P(FSF) = p(l — p)?_—-P(SSF) = p*(1 — p) 
P(SFF) = p(l — p)? —- P(SSS) = p? 

so that 
P(T; = 0) = (1 — p> 
P(T; = 1) = 3p(1 — p? 
P(T; = 2) = 3p°(1 — p) 
P(T; = 3) = p® 


It should now be apparent (although it has not been proved) that if a 
particular sequence contains r successes and n — r failures, then its prob- 
ability is p’(1 — p)""’. How many sequences are there containing r 
successes and n — r failures? Let us look at it this way. We may arrive at 
such a sequence by writing down a string of n circles 


and then writing S (corresponding to “success”) inside r of the circles and 
F (corresponding to “failure”) inside the remaining n — r circles. For 
example, 


©.9,©.©.©...- 


The number of such sequences is the same as the number of ways of 
selecting r items from n; and this is 


n\ nt 
r} orln—n! 
Each such sequence has probability p’(1 — p)"~" of occurring. All such 


sequences are different, that is, exclusive; therefore the probability of 
getting r successes is equal to the probability of getting one or other of the 


(" such sequences, which is 
r 


n = 
PL — py". 
r 
Thus we arrive at the probability distribution given by 
PT, =" = ("Ira — pr (= 0,1,2,...,n). 
r 
As we have a distribution, we would expect all the probabilities on the 


right-hand side to sum to unity. This they do ; for representing 1 — p by q, 
we have 


& [n Rees (eee 
Eloi 5 lr 
=(p+4q)" (by the binomial theorem) 


=1 (P+q=1) 


Because of the form this distribution takes, we call it the binomial 
distribution, and we say that T, is a binomial random variable. 


The binomial distribution occurs frequently. Mention has been made 
earlier of failure in aircraft tyres. In that context we would probably 
(rather perversely) regard a tyreburst as a “success” and lack of a burst as 


36 


FM 21.3.1 


(See RB9) 


“failure”’. The two-party voting situation may also be looked upon in this 
way. Those who vote Conservative may be looked on as “successes” 
(or, of course, ““failures”), although it is doubtful whether votes in an actual 
election may be regarded as a random sample of the potential votes of the 
whole electorate. On the other hand, a public opinion poll, which selects 
a few people to find their political views, may obtain a random sample of 
those views in the electorate. Another example occurs in medical work, 
where patients treated with a drug will either recover or not. Notice that 
although the basic model has been described in terms of a sequence of 
trials, the trials may, in fact, be simultaneous. For example, a group of 
hospital patients might all be treated with a drug at the same time. 


Exercise | 


(i) Taking p = $, n = 6, work out the complete binomial distribution 
(ie. find P(T, =r) forr = 0,1 6). 
(ii) Find the mean and variance of the distribution 


P(X =l)=p 
P(X =0)=1—p 
(iii) Use the result of (ii) to show that the binomial random variable T, has 


mean np 
and 
variance np(l — p). 
(HINT: Use results (viii) and (ix) in the table on page 29.) a 


37 


FM 21.3.1 


Exercise 1 
G minutes) 


FM 213.1 


Solution 1 Solution 1 


0 t= 9 = ("ru ~ pr where n = 6 and p = 4, 
_ (9). 2 
eae oe 


628" 
a (" 729 


Hence P(T, = 0) = (ola = & 
PT, = 1) = 
Meas a 7 = 
rr, = 4) = X42 
aera) = a 
PT, = 9) = a5 


(ii) E(X)=1 x p+0x (1 —p) 
ont 
E(X?) =1 x p+0x(1—p) 
Lot) 1) 
Therefore, 
var (X) = p — p? 
= pl — p) 
(ii) T, = X, + X, +7 +X, 
so that 


T, Xi + Xa ts + %, ay 


n n 


and we have 
(7) = E(X) = E(X) (from table) 


=p (from (ii)), 
whence 
E(T,) = np (see Exercise 21.1.4.3). 
Similarly, 


ian ) a yaniX)= o2n | (iGmtnble) 


— Pil =p) 


7 aso? = var (X) = p(l — p). 


38 


Hence 


var (T,) =n? x 


ea 
a») (see Exercise 21.1.4.3) 


= np(l — p) = 


21.3.2 Estimating an Unknown Probability 


Our intuitive concepts of probability were derived from the idea of relative 
frequency in the long run. Hence to estimate the probability of some event, 
we have merely taken a sequence of independent trials, and worked out the 
relative frequency of occurrences. We are now in a better position to 
appreciate what is really happening, and to explain why longer runs tend 
to give better estimates. 


If X is a random variable representing the number (1 or 0) of occurrences 
of the event in a single trial, then 


P(X =1)=p 
P(X =0)=1-—p 
This gives a distribution in which p is a parameter. Therefore, estimating 


this unknown probability is an example of parametric estimation. Hence 
we require some suitable sampling statistic. 


If x, is the number of occurrences on the first trial, x, the number on the 
second, and so forth, the full record of results x,.x,.....X, constitutes 
a sample from the distribution of X. The sampling statistic 


T,= 


Kp + Xp +---4+X, 


has mean E(T,) = np (see Exercise 21.3.1.1), and 
oh 
Ei) = 
ta)? 
We should therefore accept & = ¥asa sampling statistic for estimating p, 


and the realization of X in our case is 


Xp $Xy te +X, 
n 


In gauging how good an estimator any sampling statistic may be, we look 
at the sampling variance of the statistic 


var (X) = aa (see Exercise 21.3.1.1), 


so that as n increases, the variance of the sampling statistic decreases, and 
the estimator X improves. 


This establishes the mathematics of the situation. Let us take some 
numerical examples and see how increases in n affect the sampling 


distribution ot = 


In general, we have 


= ("la =e 


= p, (say) 


39 


FM 21.3.1, 213.2 


213.2 


Main Text 


FM 21.3.2 


The following diagrams illustrate the sampling distributions for n = 
1, 2,5, 10 in the cases p = 4 and p = $. They are bar charts of p, plotted 
against r/n, where P(T,/n = rjn) = p, 


p=} p-} 


ow 


nt 


woh 
ero 
ly 


° 
° 


° } 1 0 } 1 


Es 
Oe ee oes ta a OT bo 
ped p=5 
n=10 ih 
° $ 1 ° $ 1 


The important thing to notice about these sampling distributions is the 
way in which, as the sample size increases, the distribution becomes more 
and more concentrated about the “true” value of the parameter p. 
This is a result we expected from our consideration of the variance of 
fs 


Exercise I 


(i) Go back to your sequence of 0’s and 1's obtained from card guessing 
in Unit 16, and divide up the 500 terms into 50 non-overlapping runs of 
10 terms each. For each run, work out the relative frequency of 1's, 
and then plot these 50 relative frequencies in some way to see how they 
cluster. 

(ii) Repeat this process, but with 20 non-overlapping runs of 25 termseach. 
Is the clustering any greater? a 


FM 21.3.2 


Postscript 


“And now sits Expectation in the air.” 
William Shakespeare 
Henry V 


41 


M100 - MATHEMATICS FOUNDATION COURSE UNITS 


Functions 

Errors and Accuracy 
Operations and Morphisms 
Finite Differences 

NO TEXT 

Inequalities 

Sequences and Limits | 
Computing I 

Integration | 

NO TEXT 

Logic I — Boolean Algebra 
Differentiation I 
Integration IL 

Sequences and Limits II 
Differentiation II 
Probability and Statistics 1 
Logic Il — Proof 
Probability and Statistics II 
Relations 

Computing II 

Probability and Statistics III 
Linear Algebra | 

Linear Algebra II 
Differential Equations I 
NO TEXT 

Linear Algebra II 
Complex Numbers | 
Linear Algebra IV 
Complex Numbers II 
Groups I 

Differential Equations II 
NO TEXT 

Groups Il 

Number Systems 
Topology 

Mathematical Structures 


Te 


2 iS 


